{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Llama model pre-training on Intel Gaudi\n",
    "\n",
    "<a id=\"try-anyscale-quickstart-intel_gaudi-llama_pretrain\" href=\"https://console.anyscale.com/register/ha?render_flow=ray&utm_source=ray_docs&utm_medium=docs&utm_campaign=intel_gaudi-llama_pretrain\">\n",
    "    <img src=\"../../../_static/img/run-on-anyscale.svg\" alt=\"try-anyscale-quickstart\">\n",
    "</a>\n",
    "<br></br>\n",
    "\n",
    "In this Jupyter notebook, we will pre-train a [huggyllama/llama-7b](https://huggingface.co/huggyllama/llama-7b) model by using Intel Gaudi accelerators.\n",
    "\n",
    "We will use PyTorch for model training and Ray for distributed training.\n",
    "\n",
    "[Intel Gaudi AI Processors (HPUs)](https://habana.ai) are AI hardware accelerators designed by Habana Labs. For more information, see [Gaudi Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/index.html) and [Gaudi Developer Docs](https://developer.habana.ai/).\n",
    "\n",
    "Basic features for this pre-training example are:\n",
    "- Running on HPUs, support three execution mode: [\"lazy\", \"eager\", \"eager.compile\"](https://docs.habana.ai/en/latest/PyTorch/Reference/PyTorch_Gaudi_Theory_of_Operations.html).\n",
    "- Pre-training llama model use configuration [huggyllama/llama-7b](https://huggingface.co/huggyllama/llama-7b)\n",
    "- [`GaudiTrainer`](https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/trainer.py) based training.\n",
    "- DeepSpeed based pre-training.\n",
    "- Ray based resource scheduling and management."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare environment\n",
    "This example run on single node with 4 HPUs.\n",
    "\n",
    "We recommend using a prebuilt container to run these examples. To run a container, you need Docker. See [Install Docker Engine](https://docs.docker.com/engine/install/) for installation instructions.\n",
    "\n",
    "Next, follow [Run Using Containers](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html?highlight=installer#run-using-containers) to install the Habana drivers and container runtime.\n",
    "\n",
    "### Get docker image\n",
    "``` bash\n",
    "# more available docker image can be found here: https://vault.habana.ai/ui/native/gaudi-docker\n",
    "docker pull vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest\n",
    "```\n",
    "### Run docker image\n",
    "``` bash\n",
    "docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest\n",
    "# maybe should mapping your workspace volumns\n",
    "```\n",
    "### Install dependency\n",
    "``` bash\n",
    "# \"optimum-habana>1.11.1\" if exection mode \"eager\" or \"eager.compile\" \n",
    "# \"ray>=2.20.0\"\n",
    "pip install ray[train] notebook transformers datasets evaluate peft accelerate scikit-learn optimum-habana\n",
    "\n",
    "# install deepspeed\n",
    "pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.20.0\n",
    "\n",
    "# this notebook verfied with packages' version:\n",
    "# transformers==4.45.2\n",
    "# datasets==3.3.2\n",
    "# evaluate==0.4.3\n",
    "# peft==0.14.0\n",
    "# accelerate==0.33.0\n",
    "# scikit-learn==1.6.1\n",
    "# optimum-habana==1.15.0\n",
    "\n",
    "# deepspeed==0.16.1+hpu.synapse.v1.20.0\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import necessary libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!/usr/bin/env python\n",
    "\n",
    "import os\n",
    "from typing import Any, Dict\n",
    "from torch.utils.data import DataLoader\n",
    "\n",
    "import transformers\n",
    "from itertools import chain\n",
    "from datasets import load_dataset\n",
    "from transformers import default_data_collator\n",
    "from transformers.testing_utils import CaptureLogger\n",
    "from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments\n",
    "from optimum.habana.utils import set_seed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build datasets\n",
    "\n",
    "Download and load dataset from huggingface.co"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_datasets(config):\n",
    "    dataset_name = config[\"name\"] \n",
    "    dataset_config_name = config[\"config_name\"]\n",
    "\n",
    "    # Downloading and loading a dataset from the hub.\n",
    "    raw_datasets = load_dataset(\n",
    "        dataset_name,\n",
    "        dataset_config_name,\n",
    "        cache_dir=None,\n",
    "        token=None,\n",
    "        streaming=False,\n",
    "    )\n",
    "    if \"validation\" not in raw_datasets.keys():\n",
    "        raw_datasets[\"validation\"] = load_dataset(\n",
    "            dataset_name,\n",
    "            dataset_config_name,\n",
    "            split=f\"train[:{data_args.validation_split_percentage}%]\",\n",
    "            cache_dir=None,\n",
    "            token=None,\n",
    "            streaming=False,\n",
    "        )\n",
    "        raw_datasets[\"train\"] = load_dataset(\n",
    "            dataset_name,\n",
    "            dataset_config_name,\n",
    "            split=f\"train[{data_args.validation_split_percentage}%:]\",\n",
    "            cache_dir=None,\n",
    "            token=None,\n",
    "            streaming=False,\n",
    "        )\n",
    "\n",
    "    return raw_datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load tokenizer\n",
    "\n",
    "Download vocabulary from huggingface.co."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_tokenizer(config):\n",
    "    name = config[\"name\"]\n",
    "    tokenizer_kwargs = {\n",
    "        \"cache_dir\": None,\n",
    "        \"use_fast\": True,\n",
    "        \"revision\": \"main\",\n",
    "        \"token\": None,\n",
    "        \"trust_remote_code\": False,\n",
    "    }\n",
    "    return transformers.AutoTokenizer.from_pretrained(name, **tokenizer_kwargs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tokenize dataset\n",
    "\n",
    "tokenize word to token ids."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def tokenize_dataset(datasets, tokenizer):\n",
    "    column_names = list(datasets[\"train\"].features)\n",
    "    text_column_name = \"text\" if \"text\" in column_names else column_names[0]\n",
    "\n",
    "    tok_logger = transformers.utils.logging.get_logger(\"transformers.tokenization_utils_base\")\n",
    "\n",
    "    def tokenize_function(examples):\n",
    "        with CaptureLogger(tok_logger) as cl:\n",
    "            output = tokenizer(examples[text_column_name])\n",
    "        # clm input could be much much longer than block_size\n",
    "        if \"Token indices sequence length is longer than the\" in cl.out:\n",
    "            tok_logger.warning(\n",
    "                \"^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits\"\n",
    "                \" before being passed to the model.\"\n",
    "            )\n",
    "        return output\n",
    "\n",
    "    tokenized_datasets = datasets.map(\n",
    "        tokenize_function,\n",
    "        batched=True,\n",
    "        num_proc=None,\n",
    "        remove_columns=column_names,\n",
    "        load_from_cache_file=True,\n",
    "        desc=\"Running tokenizer on dataset\",\n",
    "    )\n",
    "\n",
    "    return tokenized_datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Group dataset\n",
    "\n",
    "This preprocssing will concatenate all texts from our dataset and generate chunks of block_size, and will pre-train model much faster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def group_dataset(config, datasets, tokenizer):\n",
    "    config_name = config[\"name\"]\n",
    "    auto_config = transformers.AutoConfig.from_pretrained(config_name)\n",
    "    max_pos_embeddings = auto_config.max_position_embeddings\n",
    "    block_size = tokenizer.model_max_length\n",
    "    if block_size > max_pos_embeddings:\n",
    "        print(\n",
    "            f\"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). \"\n",
    "            f\"Using block_size={min(1024, max_pos_embeddings)} instead. You can change that default value by passing --block_size xxx.\"\n",
    "        )\n",
    "        if max_pos_embeddings > 0:\n",
    "            block_size = min(1024, max_pos_embeddings)\n",
    "        else:\n",
    "            block_size = 1024\n",
    "\n",
    "    # Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.\n",
    "    def group_texts(examples):\n",
    "        # Concatenate all texts.\n",
    "        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}\n",
    "        total_length = len(concatenated_examples[list(examples.keys())[0]])\n",
    "        # We drop the small remainder, and if the total_length < block_size  we exclude this batch and return an empty dict.\n",
    "        # We could add padding if the model supported it instead of this drop, you can customize this part to your needs.\n",
    "        total_length = (total_length // block_size) * block_size\n",
    "        # Split by chunks of max_len.\n",
    "        result = {\n",
    "            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]\n",
    "            for k, t in concatenated_examples.items()\n",
    "        }\n",
    "        result[\"labels\"] = result[\"input_ids\"].copy()\n",
    "        return result\n",
    "\n",
    "    lm_datasets = datasets.map(\n",
    "        group_texts,\n",
    "        batched=True,\n",
    "        num_proc=None,\n",
    "        load_from_cache_file=True,\n",
    "        desc=f\"Grouping texts in chunks of {block_size}\",\n",
    "    )\n",
    "    return lm_datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load model\n",
    "\n",
    "Download and load pre-configed model from huggingface.co, the detail model configurations in config.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_model(config):\n",
    "    name = config[\"name\"]\n",
    "    model_config = config.get(\"config\", {})\n",
    "    auto_config = transformers.AutoConfig.from_pretrained(\n",
    "        pretrained_model_name_or_path=name, **model_config\n",
    "    )\n",
    "    model = transformers.AutoModelForCausalLM.from_config(auto_config, trust_remote_code=False)\n",
    "\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare trainer\n",
    "\n",
    "Instance Trainer with `model`, `gaudi_config`, `training_args`, `tokenizer`\n",
    "\n",
    "No evaluation dataset passed, just training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_trainer(training_args, datasets, tokenizer, model):\n",
    "    gaudi_config = GaudiConfig.from_pretrained(\n",
    "        training_args.gaudi_config_name, revision=\"main\",\n",
    "    )\n",
    "\n",
    "    trainer = GaudiTrainer(\n",
    "        model=model,\n",
    "        gaudi_config=gaudi_config,\n",
    "        args=training_args,\n",
    "        train_dataset=datasets[\"train\"],\n",
    "        eval_dataset=None,\n",
    "        tokenizer=tokenizer,\n",
    "        data_collator=default_data_collator,\n",
    "    )\n",
    "    return trainer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training Function\n",
    "\n",
    "This function will be executed by each worker during training, with following steps:\n",
    "- prepare GaudiTrainingArguments object.\n",
    "- load datasets from huggingface.co.\n",
    "- load pre-configed tokenizer from huggingface.co.\n",
    "- tokenize dataset with loaded model tokenizer.\n",
    "- concatenate all texts from our dataset and generate chunks of block_size.\n",
    "- instance object of `GaudiTrainer` with training_args, datasets, tokenizer, and model.\n",
    "- call `train` of trainer.\n",
    "- save model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pretrain_llama(config: Dict[str, Any]):\n",
    "\n",
    "    training_args = GaudiTrainingArguments(**config[\"training_args\"])\n",
    "    set_seed(training_args.seed)\n",
    "\n",
    "    raw_datasets = load_datasets(config[\"datasets\"])\n",
    "\n",
    "    tokenizer = load_tokenizer(config[\"tokenizer\"])\n",
    "\n",
    "    tokenized_datasets = tokenize_dataset(raw_datasets, tokenizer)\n",
    "\n",
    "    tokenized_datasets = group_dataset(config[\"model\"], tokenized_datasets, tokenizer)\n",
    "\n",
    "    model = load_model(config[\"model\"])\n",
    "\n",
    "    trainer = get_trainer(training_args, tokenized_datasets, tokenizer, model)\n",
    "\n",
    "    result = trainer.train()\n",
    "    trainer.save_model()\n",
    "    print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Main Training Function\n",
    "\n",
    "The `main` function sets up the distributed training environment using Ray and starts the training process. To enable training using HPU, we only need to make the following changes:\n",
    "- Set the exectuion mode for training, supported execution mode are:\n",
    "\n",
    "    - \"lazy\": Deferred execution of graphs, comprising of ops delivered from script op by op similar to Eager mode. It gives the Eager mode experience with performance on Gaudi. Unlike Eager Mode with torch.compile, graph is analyzed in each iteration leading to a higher CPU usage.\n",
    "    - \"eager\": Op-by-op execution as defined in standard PyTorch Eager mode scripts.\n",
    "    - \"eager.compile\": Eager mode extended with `torch.compile` - Similar to Eager mode but extended with wrapping complete or part of model (such as a function) into a graph. Parts that are not wrapped are executed eagerly.\n",
    "\n",
    "    More detail theory can be found [here](https://docs.habana.ai/en/latest/PyTorch/Reference/PyTorch_Gaudi_Theory_of_Operations.html), and detail performance results can be found [here](https://developer.habana.ai/get-started/habana-models-performance/)\n",
    "- Require an HPU for each worker in ScalingConfig\n",
    "- Set backend to `hccl` in TorchConfig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def main(num_workers, execution_mode):\n",
    "    import ray\n",
    "    from ray.train import ScalingConfig\n",
    "    from ray.train.torch import TorchTrainer, TorchConfig\n",
    "\n",
    "    pretrain_config = {\n",
    "        \"datasets\": {\n",
    "            \"name\": \"wikitext\",\n",
    "            \"config_name\": \"wikitext-2-raw-v1\",\n",
    "        },\n",
    "        \"tokenizer\": {\n",
    "            \"name\": \"huggyllama/llama-7b\",\n",
    "            \"config\": {}\n",
    "        },\n",
    "        \"model\": {\n",
    "            \"name\": \"huggyllama/llama-7b\",\n",
    "            \"config\": {\n",
    "                \"torch_dtype\": \"bfloat16\",\n",
    "            },\n",
    "        },\n",
    "        \"training_args\": {\n",
    "            \"per_device_train_batch_size\": 1,\n",
    "            \"do_train\": True,\n",
    "            \"save_strategy\": \"no\",\n",
    "            \"output_dir\": \"/tmp/ray/pretrain-llama-2\",\n",
    "            \"logging_steps\": 1,\n",
    "            \"gaudi_config_name\": \"Habana/llama\",\n",
    "            \"use_habana\": True,\n",
    "            \"throughput_warmup_steps\": 3,\n",
    "            \"use_lazy_mode\": True,\n",
    "            \"overwrite_output_dir\": True,\n",
    "            \"seed\": 42,\n",
    "            \"bf16\": True,\n",
    "            \"report_to\":'tensorboard',\n",
    "            \"deepspeed\": {\n",
    "                \"steps_per_print\": 64,\n",
    "                \"train_batch_size\": \"auto\",\n",
    "                \"train_micro_batch_size_per_gpu\": \"auto\",\n",
    "                \"gradient_accumulation_steps\": \"auto\",\n",
    "                \"bf16\": {\n",
    "                    \"enabled\": True\n",
    "                },\n",
    "                \"gradient_clipping\": 1.0,\n",
    "                \"zero_optimization\": {\n",
    "                    \"stage\": 3,\n",
    "                    \"overlap_comm\": False,\n",
    "                    \"reduce_scatter\": False,\n",
    "                    \"contiguous_gradients\": False,\n",
    "                    \"stage3_gather_16bit_weights_on_model_save\": True\n",
    "                }\n",
    "            },\n",
    "        },\n",
    "    }\n",
    "\n",
    "    # if execution mode is eager with compile, must spcified with a compile backend\n",
    "    if execution_mode == \"eager.compile\":\n",
    "        pretrain_config[\"training_args\"].update({\"torch_compile_backend\": \"hpu_backend\"})\n",
    "\n",
    "    scaling_config = ScalingConfig(num_workers=num_workers,\n",
    "                                   use_gpu=False,\n",
    "                                   resources_per_worker={\"CPU\": 1, \"HPU\": 1})\n",
    "\n",
    "    # Set backend to hccl in TorchConfig\n",
    "    torch_config = TorchConfig(backend=\"hccl\")\n",
    "\n",
    "    ray.init()\n",
    "\n",
    "    # Initialize a Ray TorchTrainer\n",
    "    trainer = TorchTrainer(\n",
    "        train_loop_per_worker=pretrain_llama,\n",
    "        train_loop_config=pretrain_config,\n",
    "        torch_config=torch_config,\n",
    "        scaling_config=scaling_config\n",
    "    )\n",
    "\n",
    "    result = trainer.fit()\n",
    "    print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start Training\n",
    "\n",
    "Finally, we call the `main` function to start the pre-training process.\n",
    "\n",
    "Before calling `main` function, you must set some environment variables.\n",
    "\n",
    "1. The visiable devices. Environment variable `HABANA_VISIBLE_DEVICES` and `HABANA_VISIBLE_MODULES` are used to control the HPU device visiable to application, you must set this two environment variable properly. For more detail usage of `HABANA_VISIBLE_DEVICES`, `HABANA_VISIBLE_MODULES`, please visit [here](https://docs.habana.ai/en/latest/PyTorch/Reference/PT_Multiple_Tenants_on_HPU/Multiple_Dockers_each_with_Single_Workload.html)\n",
    "\n",
    "2. The execution mode. Different execution mode has different runtime performance. The default execution mode is lazy mode."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set some environment variables\n",
    "os.environ[\"RAY_EXPERIMENTAL_NOSET_HABANA_VISIBLE_MODULES\"] = \"0\"\n",
    "# if using RAY_EXPERIMENTAL_NOSET_HABANA_VISIBLE_MODULES env var\n",
    "# you must set HABANA_VISIBLE_MODULES, such as\n",
    "# os.environ[\"HABANA_VISIBLE_MODULES\"] = \"0,1,2,3\"\n",
    "\n",
    "# execution_mode are [\"lazy\", \"eager\", \"eager.compile\"]\n",
    "execution_mode = \"lazy\"\n",
    "os.environ[\"PT_HPU_LAZY_MODE\"] = \"1\" if execution_mode == \"lazy\" else \"0\"\n",
    "\n",
    "main(num_workers=4, execution_mode=execution_mode)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Possible outputs\n",
    "\n",
    "``` text\n",
    "\n",
    "...\n",
    "\n",
    "RayTrainWorker pid=36561) Setting up process group for: env:// [rank=0, world_size=4]\n",
    "(TorchTrainer pid=36054) Started distributed worker processes: \n",
    "(TorchTrainer pid=36054) - (node_id=409da2dba1dc3e5b8e58a2b766a4a19d90e7879c28c2fb13644148b8, ip=100.83.111.228, pid=36561) world_rank=0, local_rank=0, node_rank=0\n",
    "(TorchTrainer pid=36054) - (node_id=409da2dba1dc3e5b8e58a2b766a4a19d90e7879c28c2fb13644148b8, ip=100.83.111.228, pid=36562) world_rank=1, local_rank=1, node_rank=0\n",
    "(TorchTrainer pid=36054) - (node_id=409da2dba1dc3e5b8e58a2b766a4a19d90e7879c28c2fb13644148b8, ip=100.83.111.228, pid=36563) world_rank=2, local_rank=2, node_rank=0\n",
    "(TorchTrainer pid=36054) - (node_id=409da2dba1dc3e5b8e58a2b766a4a19d90e7879c28c2fb13644148b8, ip=100.83.111.228, pid=36564) world_rank=3, local_rank=3, node_rank=0\n",
    "\n",
    "...\n",
    "\n",
    "(RayTrainWorker pid=36561) ============================= HABANA PT BRIDGE CONFIGURATION =========================== \n",
    "(RayTrainWorker pid=36561)  PT_HPU_LAZY_MODE = 1\n",
    "(RayTrainWorker pid=36561)  PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024\n",
    "(RayTrainWorker pid=36561)  PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807\n",
    "(RayTrainWorker pid=36561)  PT_HPU_LAZY_ACC_PAR_MODE = 0\n",
    "(RayTrainWorker pid=36561)  PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0\n",
    "(RayTrainWorker pid=36561)  PT_HPU_EAGER_PIPELINE_ENABLE = 1\n",
    "(RayTrainWorker pid=36561)  PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1\n",
    "(RayTrainWorker pid=36561)  PT_HPU_ENABLE_LAZY_COLLECTIVES = 0\n",
    "(RayTrainWorker pid=36561) ---------------------------: System Configuration :---------------------------\n",
    "(RayTrainWorker pid=36561) Num CPU Cores : 160\n",
    "(RayTrainWorker pid=36561) CPU RAM       : 1056374420 KB\n",
    "(RayTrainWorker pid=36561) ------------------------------------------------------------------------------\n",
    "\n",
    "...\n",
    "\n",
    "(RayTrainWorker pid=36561) {'loss': 4.1052, 'grad_norm': 2.225008249282837, 'learning_rate': 8.26086956521739e-06, 'epoch': 2.5, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.0472, 'grad_norm': 2.0701019763946533, 'learning_rate': 8.212560386473431e-06, 'epoch': 2.51, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.097, 'grad_norm': 2.119075059890747, 'learning_rate': 8.164251207729469e-06, 'epoch': 2.51, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 3.7035, 'grad_norm': 2.1802899837493896, 'learning_rate': 8.115942028985508e-06, 'epoch': 2.51, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.242, 'grad_norm': 1.9516953229904175, 'learning_rate': 8.067632850241547e-06, 'epoch': 2.52, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 3.9594, 'grad_norm': 2.0580222606658936, 'learning_rate': 8.019323671497584e-06, 'epoch': 2.52, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.3415, 'grad_norm': 2.192605495452881, 'learning_rate': 7.971014492753623e-06, 'epoch': 2.52, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 3.9739, 'grad_norm': 2.0198025703430176, 'learning_rate': 7.922705314009662e-06, 'epoch': 2.52, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.1624, 'grad_norm': 2.0957565307617188, 'learning_rate': 7.874396135265701e-06, 'epoch': 2.53, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 3.9744, 'grad_norm': 2.1159448623657227, 'learning_rate': 7.82608695652174e-06, 'epoch': 2.53, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.1127, 'grad_norm': 2.159834623336792, 'learning_rate': 7.777777777777777e-06, 'epoch': 2.53, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.0588, 'grad_norm': 2.106534004211426, 'learning_rate': 7.729468599033817e-06, 'epoch': 2.54, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 3.8734, 'grad_norm': 2.445814371109009, 'learning_rate': 7.681159420289856e-06, 'epoch': 2.54, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.0278, 'grad_norm': 2.0376927852630615, 'learning_rate': 7.632850241545895e-06, 'epoch': 2.54, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 3.9643, 'grad_norm': 2.1097891330718994, 'learning_rate': 7.584541062801932e-06, 'epoch': 2.54, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.1384, 'grad_norm': 2.157325267791748, 'learning_rate': 7.536231884057972e-06, 'epoch': 2.55, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 3.9982, 'grad_norm': 2.230065107345581, 'learning_rate': 7.48792270531401e-06, 'epoch': 2.55, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.0988, 'grad_norm': 2.355875015258789, 'learning_rate': 7.439613526570048e-06, 'epoch': 2.55, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.0514, 'grad_norm': 2.1178295612335205, 'learning_rate': 7.391304347826088e-06, 'epoch': 2.56, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 3.9858, 'grad_norm': 2.089723825454712, 'learning_rate': 7.342995169082126e-06, 'epoch': 2.56, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.1548, 'grad_norm': 2.308490753173828, 'learning_rate': 7.294685990338164e-06, 'epoch': 2.56, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.0356, 'grad_norm': 1.9994627237319946, 'learning_rate': 7.246376811594203e-06, 'epoch': 2.57, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 3.7696, 'grad_norm': 1.9719663858413696, 'learning_rate': 7.1980676328502416e-06, 'epoch': 2.57, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.0157, 'grad_norm': 2.1598856449127197, 'learning_rate': 7.1497584541062814e-06, 'epoch': 2.57, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.0113, 'grad_norm': 1.997869849205017, 'learning_rate': 7.10144927536232e-06, 'epoch': 2.57, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.1048, 'grad_norm': 2.099222183227539, 'learning_rate': 7.053140096618358e-06, 'epoch': 2.58, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.0048, 'grad_norm': 2.100231885910034, 'learning_rate': 7.004830917874397e-06, 'epoch': 2.58, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.0302, 'grad_norm': 2.18204402923584, 'learning_rate': 6.956521739130435e-06, 'epoch': 2.58, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 3.7227, 'grad_norm': 2.190962553024292, 'learning_rate': 6.908212560386473e-06, 'epoch': 2.59, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.1111, 'grad_norm': 2.349518060684204, 'learning_rate': 6.859903381642513e-06, 'epoch': 2.59, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.024, 'grad_norm': 2.5497331619262695, 'learning_rate': 6.811594202898551e-06, 'epoch': 2.59, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 3.8844, 'grad_norm': 2.3125178813934326, 'learning_rate': 6.7632850241545894e-06, 'epoch': 2.59, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "(RayTrainWorker pid=36561) {'loss': 4.2208, 'grad_norm': 2.1103923320770264, 'learning_rate': 6.7149758454106285e-06, 'epoch': 2.6, 'memory_allocated (GB)': 28.87, 'max_memory_allocated (GB)': 94.26, 'total_memory_available (GB)': 94.62}\n",
    "\n",
    "...\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "orphan": true
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
